What is a H1B Visa

H-1B visas are a category of employment-based, non-immigrant visas for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, a US employer must offer them a job and submit a petition for a H-1B visa to the US immigration department. This is also the most common visa status applied for and held by international students once they complete college or higher education and begin working in a full-time position.

Dataset Description

This dataset contains five year’s worth of H-1B petition data, with approximately 3 million records overall. The columns in the dataset include case status, employer name, worksite coordinates, job title, prevailing wage, occupation code, and year filed.

Objective

The objective of this project is to analyse and gain further knowledge into the H1B applications filed in the year 2016 in United States of America

Data Preprocessing

The data set with three million rows is filetered down to seventy thousand rows to suit the project requirement. It consists of eleven attributes for the year 2016 after the filtering.

The next major step in data preprocessing is handling with null, N/A vaues and the outliers. All the rows of the class attributes with N/A or null are removed and outliers are dealt with.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##        35     57320     68240     89150     85180 329100000

Categorical Sepration of applicants into respective groups

Examine the salary distribution of applicants based on job status

Below is a box plot and histogram of the prevailing wage(salary) distribution for each group of applicants.

.

Salary Distribution of applicants

Central Limit Theorem

The central limit theorem states that the distribution of sample means, taken from independent random sample sizes, follows a normal distribution even if the original population is not normally distributed. This is important because there are a lot of statistical procedures that require normality in the data set. As a result we can apply statistical techniques that assume normality even when the population is non normal. Using the Prevailing_wage attribute in this data set the applicability of the central limit theorem can be shown. As displayed in the histogram above, the salary distribution of all applicants have a normal distribution. Prevailing_wage will be used as an example to show the application of the central limit theorem. Below is are histograms showing the sample means of 1000 random samples of sample size 10, 20, 30, and 40 follow a normal distribution.

## Population , Mean = 70549.71 , SD =  19868.77
## Sampe Size 10, Mean = 70588.63 , SD = 6283.057 
## Sampe Size 20, Mean = 70505.46 , SD = 4442.792 
## Sampe Size 30, Mean = 70551.13 , SD = 3627.525 
## Sampe Size 40, Mean = 70588.15 , SD = 3141.528

Sampling

Sampling is a technique to select a representative portion of the population to perform a study on. There are many different sampling techniques including simple random sampling, systematic sampling, and stratified sampling. Simple random sampling is a basic sampling technique where individual subjects are selected from a larger group. In this case, every sample has the same chance of getting picked. Systematic sampling is a method where samples are selected via a fixed periodic interval. The interval is calculated by dividing the whole population sample by the desired sample size. The first sample is decided randomly within the first interval. When looking at a normal distribution, the sample mean can be used as an estimate for the population mean. Given a certain confidence level, a confidence interval is defined. The confidence interval is range of values which contains the population mean with the given confidence level.

For this project the coding worker population with be analyzed. Simple random sampling without replacement, systematic sampling, and unequal probability sampling will be utilized as sampling methods

Confidence Intervals of the mean

## Prevailing wage : mean = 70549.71  and sd = 19868.77
## 80% Conf Level (alpha = 0.20), CI = 45086.86 - 96012.57 
## 90% Conf Level (alpha = 0.10), CI = 37868.49 - 103230.93
## SRSWR : mean = 69082.13  and sd = 1986.877
## 80% Conf Level (alpha = 0.20), CI = 66535.84 - 71628.42 
## 90% Conf Level (alpha = 0.10), CI = 65814.01 - 72350.25
## SRSWOR : mean = 68519.28  and sd = 1986.877
## 80% Conf Level (alpha = 0.20), CI = 65972.99 - 71065.56 
## 90% Conf Level (alpha = 0.10), CI = 65251.16 - 71787.40
## SRSWOR : mean = 67379.96  and sd = 1986.877
## 80% Conf Level (alpha = 0.20), CI = 64833.68 - 69926.25 
## 90% Conf Level (alpha = 0.10), CI = 64111.84 - 70648.09
## UPSystematic : mean = 78333.09  and sd = 1986.877
## 80% Conf Level (alpha = 0.20), CI = 75786.80 - 80879.37 
## 90% Conf Level (alpha = 0.10), CI = 75064.96 - 81601.21

On Demand Jobs in the Market

The below bar chart gives us the top six popular jobs opting H1B visa and their mean salary during the year 2016.

State wise analysis of applications received

The bar plot gives us the inforamtion about number of H1B visa applicantions received from each of the fifty states in USA.

.